Abstract
LTR retrotransposons (LTR-RTs) are mobile genetic elements constituting remarkable portions of plant genomes and significantly contribute to genome structure, size and regulations. Some LTR-RTs lineages are distributed uniformly while others are localized in centromeres of monocentric chromosomes. But little is known about characterization and distribution of retrotransposons within genomes with holocentric chromosomes. We will test our hypothesis that the LTR RTs in holocentrics will be evenly distributed along chromosomes. Therefore we designed the comparative study with three groups where each is composed from two holocentric plant species with different karyotypes and one phylogenetically closest monocentric relative. These species will be sequenced using: (1) Oxford Nanopore Technologies (long reads) and (2) Illumina sequencing (short reads). The primary assembly will be generated and subjected to the in-depth LTR-RTs detection analysis. Our investigation will provide information on LTR-RTs families abundance and distribution in relation to the phenomenon of holocentrism.
Project workflow
- Plant species selection: Holocentric plant species are chosen across phylogenetically distant taxons and concurrently within one taxon/genera pick up at least two species with remarkably different karyotype. Each higher holocentric taxon will be chosen its phylogenetically closest monocentric counterparts. Representatives of monocots and eudicots will be included (for plant species list see Table 1).
- Genomic DNA sequencing: We will obtain a partial and/or full-length sequences of LTR retrotransposons from all plant species and a primary genome assembly of plant subset using combination of long (Oxford Nanopore Technologies) and short (Illumina sequencing) reads.
- Subsequent analysis: o in-depth search and annotation of full-length LTR retrotransposons o their distribution in chromosomes (alternatively in assembled scaffolds) o identification of additional sequences carried within retroelements with possible function for the holocentric arrangement – satellites, tandem repeats, retrogenes, eORF, (micro) miRNA
- Based on information about abundance of LTR retrotransposon families and their distribution in our model plant species, we will describe LTR retrotransposon landscape and its relation to holocentrism.
Table 1. Plant species used in the study
gDNA isolation by J. Smerda (2020)
...
Rapid Sequencing Kit (SQK-RAD004) was used for DNA library preparation of first five species
gDNA was isolated using Promegas’ Wizard® HMW DNA Extraction Kit (C. acutiformis, E. uniglumis and L. elegans) and/or using Viktor’s phenol/chloroform based isolation protocol (O. sativa and L. pilosa)
Nanopore ‘pass’ reads statistics
Fastq quality stats of S. cereale 673 fast5 files after fast and hac basecalling
The longest reads are in ‘fastq_fail’
Check on reads from O. sativa run:
O. sativa reads quality distribution
O. sativa reads quality distribution
Dividing fastq reads into qulity based bins:
O. sativa reads quality distribution
O. sativa reads quality distribution
Mapping nanopore reads to O. sativa genome assembly (Osativa_323_v7.0.fa; Ouyang et al., 2007) using minimap2
# creating index from ref:
minimap2 -x map-ont -d osat-ont.mmi Osat.fa
# input fq files:
head fqList.txt
Osat_sub100kSeq_reads_qlt_bin_12_v2.fastq
Osat_sub100kSeq_reads_qlt_bin_15_v2.fastq
Osat_sub100kSeq_reads_qlt_bin_4_v2.fastq
Osat_sub100kSeq_reads_qlt_bin_5_v2.fastq
Osat_sub100kSeq_reads_qlt_bin_7_v2.fastq
Osat_sub100kSeq_reads_qlt_bin_9_v2.fastq
# running minimap2:
for f in $(cat fqList.txt); do n=$(echo $f | sed 's/.fastq//g') && minimap2 -ax map-ont osat-ont.mmi $f > $n'_mapped.sam'; done
# mapping stats in brief:
samtools flagstat Osat_sub100kSeq_reads_qlt_bin_5_v2_mapped.sam
4444 + 0 in total (QC-passed reads + QC-failed reads)
446 + 0 secondary
174 + 0 supplementary
0 + 0 duplicates
1660 + 0 mapped (37.35% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
O. sativa reads quality distribution
O. sativa reads quality distribution
Running sliding windows for quality check through >100kbp long reads:
O. sativa reads quality within sliding windows (read quality mean and median, respectively)
The extra long reads over 100kbp are exclusively in bin with Q up to 4. It means that nucleotide error probability is 10^(-4/10)=0.398.
So far we do not have enough supporting arguments for using of long reads.
Example of genome assembly, polishing, and scaffolding (Choi et al., 2020: “Nanopore sequencing-based genome assembly and evolutionary genomics of circum-basmati rice”)
"After completion of sequencing, the raw signal intensity data was used for base calling using flip flop (version 2.3.5) from Oxford Nanopore Technologies. Reads with a mean qscore (quality) greater than 8 and a read length greater than 8 kb were used and trimmed for adaptor sequences using Porechop (https://github.com/rrwick/Porechop). Raw nanopore sequencing reads were corrected using the program Canu (Koren et al., 2017) and then assembled with the genome assembler Flye (Kolmogorov et al., 2019)
The initial draft assemblies were polished for three rounds using the raw nanopore reads with Racon ver. 1.2.1 (Vaser et al., 2017) and one round with Medaka (https://github.com/nanoporetech/medaka) from Oxford Nanopore Technologies. Afterwards, reads from Illumina sequencing were used by bwa-mem (Li 2013) to align to the draft genome assemblies. The alignment files were then used by Pilon ver. 1.22 (Walker et al, 2014) for three rounds of polishing.
Contigs were scaffolded using a reference genome-guided scaffolding approach implemented in RaGOO (Alonge et al., 2019). Using the Nipponbare genome as a reference, we aligned the circum-basmati genomes using Minimap2 (Li, 2018). RaGOO was then used to order the assembly contigs. Space between contigs was artificially filled in with 100 “N” blocks.
Genome assembly statistics were calculated using the bbmap stats.sh script from the BBTools suite (https://jgi.doe.gov/data-and-tools/bbtools/)."
## Linking to librsvg 2.48.9
Genome assembly pipeline
Script
nano_fq_chop_filt_assembly.sh
#!/bin/bash
# get raw fq list
ls fastq_pass/FAO*fastq > fq_list.txt
# adapter trimming using porechop
mkdir 1_adaptor_trimming_porechop
for fq in $(cat fq_list.txt)
do
n=$(echo $fq | cut -f 2 -d "/" | sed 's/.fastq/_chopped.fastq/g')
porechop -i $fq -o '1_adaptor_trimming_porechop/'$n
done
# gzip chopped fqs
cd 1_adaptor_trimming_porechop
cat *_chopped.fastq >> allFastq.fastq
gzip -c allFastq.fastq > allFq_chopped.fastq.gz
# nanofilt
gunzip -c allFq_chopped.fastq.gz | NanoFilt -q 7 -l 500 --headcrop 75 | gzip > all_choppedReads_nfilt_q7_headCr_75_l500.fastq.gz
cd ..
# nanoQC stats
nanoQC -o 2_nanoQC_chopped_stats 1_adaptor_trimming_porechop/all_choppedReads_nfilt_q7_headCr_75_l500.fastq.gz
# canu corrected fqs [generates 'canuCorrect.correctedReads.fasta.gz']
mkdir 3_canu_correct
cd 3_canu_correct
cp ../1_adaptor_trimming_porechop/all_choppedReads_nfilt_q7_headCr_75_l500.fastq.gz .
canu -correct -p canuCorrect -d assembly genomeSize=10m -nanopore-raw all_choppedReads_nfilt_q7_headCr_75_l500.fastq.gz
cd ..
corrFa='3_canu_correct/canuCorrect.correctedReads.fasta.gz'
# run genome assembly using 4 different tools
# 1.
mkdir 4_flye_correctedReads_assembly
cp $corrFa 4_flye_correctedReads_assembly/.
cd 4_flye_correctedReads_assembly
flye --nano-raw canuCorrect.correctedReads.fasta.gz --out-dir flye_assembly_out --threads 6
cd ..
# 2.
mkdir 5_canu_correctedReads_assembly
cp $corrFa 5_canu_correctedReads_assembly/.
cd 5_canu_correctedReads_assembly
canu -p corrReads -d assembly_corrReads genomeSize=80m -nanopore-corrected canuCorrect.correctedReads.fasta.gz
cd ..
# 3.
mkdir 6_raven_correctedReads_assembly
cp $corrFa 6_raven_correctedReads_assembly/.
cd 6_raven_correctedReads_assembly/.
raven canuCorrect.correctedReads.fasta.gz -t 6 > raven_corrReads_assembly.fasta
cd ..
# 4.
mkdir 7_redbean_wtdbg2_correctedReads_assembly
cp $corrFa 7_redbean_wtdbg2_correctedReads_assembly/.
cd 7_redbean_wtdbg2_correctedReads_assembly
~/Documents/Soft/wtdbg2/wtdbg2.pl -t 6 -x ont -g 100m -o wtdbg2 canuCorrect.correctedReads.fasta.gz
cd ..
printf "XXX FINISHED!!! XXX"
Assembly using four different assemlers:
MUMmer: a system for rapidly aligning entire genomes
~/Documents/Soft/mummer/nucmer -p nucmerAlign IRGSP-1.0_genome.fasta Osat_flye_assembly.fasta
~/Documents/Soft/mummer/dnadiff -d nucmerAlign.delta
~/Documents/Soft/mummer/mummerplot -l nucmerAlign.delta
# filter coordinates for Ribbon visualization
cat out.1coords | cut -f 1,2,3,4,5,6,7,8,9,12,13 > out_for_ribbon.1coords
cat out.mcoords | cut -f 1,2,3,4,5,6,7,8,9,12,13 > out_for_ribbon.mcoords
cat out.mcoords | cut -f 1,2,3,4,5,6,7,8,9,12,13 | head
1244 12042 10602 21384 10799 10783 95.93 43270923 53731 chr01 contig_4247
12145 44923 21384 53731 32779 32348 97.93 43270923 53731 chr01 contig_4247
27895 27978 58454 58537 84 84 100.00 43270923 169368 chr01 contig_1987
44125 46854 2665 1 2730 2665 90.39 43270923 139529 chr01 contig_3868
46353 54453 243025 250984 8101 7960 92.76 43270923 405317 chr01 contig_1301
50824 66551 51959 36442 15728 15518 97.01 43270923 51959 chr01 contig_4723
66383 66678 219651 219359 296 293 86.84 43270923 233972 chr01 contig_1596
67259 103694 35777 1 36436 35777 96.41 43270923 51959 chr01 contig_4723
98493 103348 37 4703 4856 4667 90.33 43270923 50883 chr01 contig_1903
118081 127998 82173 72748 9918 9426 88.06 43270923 146834 chr01 contig_1124
O. sativa flye assembly aligned to nipponbare IRGSP-1.0 genome - dotPlotly overview
Visualization in Ribbon
O. sativa flye assembly aligned to nipponbare IRGSP-1.0 genome - overview
O. sativa flye assembly aligned to nipponbare IRGSP-1.0 genome - zoom of long contig
O. sativa query statistics of Mummer mapping to nipponbare reference
O. sativa dotPlotly alignment to nipponbare reference
te-g-nester
te-g-nester performance on O. sativa genome and our test-assemblies:
te-g-nester performance on test-assemblies from holocentric genomes:
RepeatModeler (v2.0.1) and RepeatMasker (v4.1.1) are run on all Flye assembly fasta sequences Metacentrum offer module with
repeatmodeler-1.0.11version without specific LTR retrotransposon determination
RepeatModeler2 flow diagram
#!/bin/bash
# input fasta
fa="canuCorrect.correctedReads.fasta"
# RepeatModeler - build DB
~/Documents/Soft/RepeatModeler-2.0.1/BuildDatabase -name db $fa
# RepeatModeler - run
nohup ~/Documents/Soft/RepeatModeler-2.0.1/RepeatModeler -database db -pa 8 -LTRStruct &> run.out
# Annotated species-specific repeat library
consFa=$(find -name "consensi.fa.classified")
# RepeatMasker - run
RepeatMasker -pa 8 -engine rmblast -gff -lib $consFa $fa
Repetitive DNA detected using RepeatModeler-RepeatMasker in Flye assemblies of all species
DNA or RNA modification probabilities information from basecalled FAST5 datasets strong correlation between in fast5 detected modification and bisulphite methylation from NGS sequencing
Nanopore/Bisulphite methylation frequency correlation (from Modified Base Tutorial)
Nanopore/Bisulphite methylation frequency correlation
guppy_basecaller on ‘fast5_pass’ data using
dna_r9.4.1_450bps_modbases_dam-dcm-cpg_hac.cfgmodPhred~/src/modPhred/run -f ref/Os-Nipponbare-Reference-IRGSP-1.0_GCF_001433935.1.fna -o modPhred/ -i workspace/ -t6~/src/modPhred/src/mod_plot.py --scatter -i modPhred/mod.gz~/src/modPhred/src/mod_correlation.py -i modPhred/mod.gz
Schematic overview of the functionalities of each of the 4 modules included in modPhred: (i) modEncode, (ii) modAlign, (iii) modReport and (iv) modAnalysis.
My storage path and address for scp:
pavel@astat:/mnt/hdd_4TB/MinION_outputs/Osat_1/CTAB_Viktor_1/20200917_0715_MN34570_FAO01582_9d0828e0$ scp osat_fast5_pass.tar.gz storage-du-cesnet.metacentrum.cz:/tape_tape/archive/VO_metacentrum/home/pj_narutro/modPhred/osat_modPhred/.
pavel@debian-10-buster:~$ ssh pj_narutro@radegast.ueb.cas.cz
(BUSTER)pj_narutro@radegast:~$ cd /storage/du-cesnet/tape_tape/archive/VO_metacentrum/home/pj_narutro/modPhred/ecol_modPhred_test/
Scratch job script:
scratchJob_guppyGpu_441_damDcmCpg_lpil.sh
Run a scratch job:
(BUSTER)pj_narutro@radegast:~$ qsub -q gpu -l select=1:ncpus=8:ngpus=1:mem=12gb:scratch_local=10gb scratchJob_guppyGpu_441_damDcmCpg_lpil.sh
Reference fasta:
- publicly available genome assembly
- de novo assembly
- chromosome or its part specified by start-end positions
- nanopore canu-corrected reads
~/src/modPhred/run -f ref/Os-Nipponbare-Reference-IRGSP-1.0_GCF_001433935.1.fna -o modPhred/ -i workspace/ -t6~/src/modPhred/src/mod_plot.py --scatter -i modPhred/mod.gz~/src/modPhred/src/mod_correlation.py -i modPhred/mod.gz
outputs
$ tree modPhred/
modPhred/
├── correlations
│ ├── NC_029256.1.csv.gz
│ ├── NC_029256.1.csv.gz.svg
│ ├── NC_029257.1.csv.gz
│ ├── NC_029257.1.csv.gz.svg
│ ├── NC_029258.1.csv.gz
│ ├── NC_029258.1.csv.gz.svg
│ ├── NC_029259.1.csv.gz
│ ├── NC_029259.1.csv.gz.svg
│ ├── NC_029260.1.csv.gz
│ ├── NC_029260.1.csv.gz.svg
│ ├── NC_029263.1.csv.gz
│ └── NC_029263.1.csv.gz.svg
├── minimap2
│ ├── workspace.bam
│ ├── workspace.bam.bai
│ ├── workspace.bam.bed
│ └── workspace.log
├── mod.bed
├── mod.gz
├── mod.gz.svg
├── modPhred.pkl
├── osatModPhred_igv_snapshot1.png
├── osatModPhred_igv_snapshot2.png
├── osatModPhred_igv_snapshot3.png
└── plots
├── basecall_accuracy.svg
├── depth.svg
├── median_mod_prob.svg
└── mod_frequency.svg
3 directories, 27 files
###
$ head mod.bed
NC_029256.1 2828345 2828346 5mC 967 + 2828345 2828346 0,0,151 27 59
NC_029256.1 2828401 2828402 5mC 850 + 2828401 2828402 0,0,59 26 23
NC_029256.1 2828464 2828465 5mC 967 + 2828464 2828465 0,0,186 26 73
NC_029256.1 2828494 2828495 5mC 817 + 2828494 2828495 0,0,41 25 16
NC_029256.1 2828590 2828591 5mC 1000 + 2828590 2828591 0,0,128 28 50
NC_029256.1 2828597 2828598 5mC 583 + 2828597 2828598 0,0,39 26 15
NC_029256.1 2828626 2828627 5mC 1000 + 2828626 2828627 0,0,235 25 92
NC_029256.1 2828633 2828634 5mC 800 + 2828633 2828634 0,0,104 27 41
NC_029256.1 2828637 2828638 5mC 833 + 2828637 2828638 0,0,143 25 56
NC_029256.1 2828652 2828653 5mC 800 + 2828652 2828653 0,0,61 25 24
$ cut -f 4 mod.bed | sort | uniq -c
38873 5mC
385 6mA
Mapping coverage of O. sativa genome
Nipponbare IRGSP-1.0; GCF_001433935.1using minimap2
QC plots
modPhred QC plots
IGV visualization
O. sativa: NC_029266.1 Chromosome 11 Reference IRGSP-1.0
O. sativa: NC_029266.1 Chromosome 11 Reference IRGSP-1.0
O. sativa: NC_029266.1 Chromosome 11 Reference IRGSP-1.0 (zoom in)
O. sativa: NC_029266.1 Chromosome 11 Reference IRGSP-1.0 (zoom in)
Modifications in co-ocurrence heatmap
O. sativa, chromosome 10 and 11 modifications heatmap.
Comparison of nanopore predicted dna methylations with Illumina data from bisulfite conversion
Bisulfite NGS data processing using Bismark tool
I have obtained over 5M uniq methylated bases accross O. sativa genome (n=5114321; CpG, CHG, CHH) modPhred detected 38873 (5mC) + 385 (6mA) = 39258; i.e. ~ 130.3x lower detection.
# SRAfile download and extraction
$ wget https://sra-download.ncbi.nlm.nih.gov/traces/sra44/SRR/012798/SRR13105618
$ ~/Documents/Soft/sratoolkit.2.4.1-ubuntu64/bin/fastq-dump --split-files SRR13105618
# fastq files subsetting
$ seqtk sample -s100 SRR13105618_1.fastq 1000000 > SRR13105618_1M_subset_1.fastq
$ seqtk sample -s100 SRR13105618_2.fastq 1000000 > SRR13105618_1M_subset_2.fastq
# run Bismark perl script
$ ~/Documents/Soft/Bismark/bismark_genome_preparation --path_to_aligner /usr/bin/ --verbose ref/
$ ~/Documents/Soft/Bismark/bismark --parallel 4 --genome ref/ -1 SRR13105618_1M_subset_1.fastq -2 SRR13105618_1M_subset_2.fastq
$ ~/Documents/Soft/Bismark/deduplicate_bismark --bam SRR13105618_1M_subset_1_bismark_bt2_pe.bam
$ ~/Documents/Soft/Bismark/bismark_methylation_extractor --gzip --bedGraph SRR13105618_1M_subset_1_bismark_bt2_pe.deduplicated.bam
$ ~/Documents/Soft/Bismark/bismark2report
$ tree bismark_methylation_extractor_outputs/
bismark_methylation_extractor_outputs/
├── CHG_OB_SRR13105618_1M_subset_1_bismark_bt2_pe.deduplicated.txt
├── CHG_OT_SRR13105618_1M_subset_1_bismark_bt2_pe.deduplicated.txt
├── CHH_OB_SRR13105618_1M_subset_1_bismark_bt2_pe.deduplicated.txt
├── CHH_OT_SRR13105618_1M_subset_1_bismark_bt2_pe.deduplicated.txt
├── CpG_OB_SRR13105618_1M_subset_1_bismark_bt2_pe.deduplicated.txt
├── CpG_OT_SRR13105618_1M_subset_1_bismark_bt2_pe.deduplicated.txt
├── extrList.txt
├── osat_bisulfite_methBases_forR.tsv
└── osat_bisulfite_methBases_subsetDiv250_forR.tsv
$ head bismark_methylation_extractor_outputs/CHG_OB_SRR13105618_1M_subset_1_bismark_bt2_pe.deduplicated.txt
Bismark methylation extractor version v0.23.0
SRR13105618.93013374_93013374_length=150 + NC_029260.1 11570272 X
SRR13105618.93013374_93013374_length=150 + NC_029260.1 11570265 X
SRR13105618.93013374_93013374_length=150 - NC_029260.1 11570250 x
SRR13105618.93013374_93013374_length=150 - NC_029260.1 11570247 x
SRR13105618.93013374_93013374_length=150 + NC_029260.1 11570234 X
SRR13105618.93013374_93013374_length=150 + NC_029260.1 11570219 X
SRR13105618.93013374_93013374_length=150 + NC_029260.1 11570213 X
SRR13105618.93013374_93013374_length=150 - NC_029260.1 11570210 x
SRR13105618.93013374_93013374_length=150 - NC_029260.1 11570191 x
$ for f in $(cat extrList.txt); do pref=$(echo $f |cut -f 1 -d 'S'| cut -f 1,2 -d "_") && grep -v "Bismark\|NW" $f | grep -P "\t[HXZ]" | cut -f 3,4 | sort | uniq | sed -r "s/$/\t$pref\tbisulfite/g"; done >> osat_bisulfite_methBases_forR.tsv
$ head bismark_methylation_extractor_outputs/osat_bisulfite_methBases_forR.tsv
NC_011033.1 212370 CHG_OB bisulfite
NC_011033.1 213507 CHG_OB bisulfite
NC_011033.1 228292 CHG_OB bisulfite
NC_011033.1 228410 CHG_OB bisulfite
NC_011033.1 240374 CHG_OB bisulfite
NC_011033.1 240381 CHG_OB bisulfite
NC_011033.1 240424 CHG_OB bisulfite
NC_011033.1 240440 CHG_OB bisulfite
NC_011033.1 240450 CHG_OB bisulfite
NC_011033.1 240463 CHG_OB bisulfite
$ wc -l bismark_methylation_extractor_outputs/osat_bisulfite_methBases_forR.tsv
5114321 bismark_methylation_extractor_outputs/osat_bisulfite_methBases_forR.tsv
$ python3 modPhre_bisulfite_intersects.py mod.bed osat_bisulfite_methBases_forR.tsv
mp_cnt_tot 39258
mp_cnt_match 7009
mp_not_in_bs 277
|
> Methylated base positions within TEs
Coverage Equation The Lander/Waterman equation is a method for computing coverage.
The general equation is:
C = LN / G
- C stands for coverage
- G is the haploid genome length
- L is the read length
- N is the number of reads
So, if we take one lane of single read human sequence with v3 chemistry, we get C = (100 bp)*(189×106)/(3×109 bp) = 6.3 This tells us that each base in the genome will be sequenced between six and seven times on average.
…
Genomic DNA of all plant species were sequenced using Illumina miSeq. Load of gDNA were adjusted proportionally to their genome size.
RepExp outputed repeats were divided into:
Figure 1.
Figure 2.
Figure 3.
Figure 4.
Figure 5.
Figure 6.
300000 read pairs per plant 221bp read length for all reads
RE - comparative analysis summary
Repeat proportion
> head(t_df_noZero)
LPIL LELE OSAT SCER CACU CPAP EUNI DCAP DFIL NVEN DLUS RACE RALP CLUT PQUA
1|10|plastid 3294 441 58 43 36 122 22 141 43 109 33 33 255 265 22
1|11|plastid 9 4 58 48 2 3 0 671 212 628 220 154 1385 1357 164
1|16|plastid 821 107 135 67 130 323 40 568 172 240 81 74 695 523 61
1|18|plastid 139 22 34 23 675 256 14 1534 559 78 30 19 162 152 15
1|21|plastid 703 79 107 68 82 249 22 286 87 253 64 83 591 504 83
1|23|plastid 717 99 97 85 124 260 39 132 53 239 92 67 555 498 80
PCA analysis - all of 268 superclusters
no plastids
Plastids
All
Angela
Athila
SIRE
Tekay
rDNA